{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "634c2168",
   "metadata": {},
   "source": [
    "Copyright (c) 2021, salesforce.com, inc.\\\n",
    "All rights reserved.\\\n",
    "SPDX-License-Identifier: BSD-3-Clause\\\n",
    "For full license text, see the LICENSE file in the repo root or https://opensource.org/licenses/BSD-3-Clause"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cb0fb337",
   "metadata": {},
   "source": [
    "**Try this notebook on [Colab](http://colab.research.google.com/github/salesforce/warp-drive/blob/master/tutorials/tutorial-1-warp_drive_basics.ipynb)!**"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9c4ebaa0",
   "metadata": {},
   "source": [
    "## ⚠️ PLEASE NOTE:\n",
    "This notebook runs on a GPU runtime.\\\n",
    "If running on Colab, choose Runtime > Change runtime type from the menu, then select 'GPU' in the dropdown."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ccd24cf5",
   "metadata": {},
   "source": [
    "# Welcome to WarpDrive!"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7c4bb0a9",
   "metadata": {},
   "source": [
    "This tutorial is a first in a series of introduction notebooks for WarpDrive, a [PyCUDA](https://documen.tician.de/pycuda/)-based framework for extremely parallelized multi-agent reinforcement learning (RL) on a single graphics processing unit (GPU).\n",
    "\n",
    "In this tutorial, we describe how we harness the GPU's ability to parallelize operations across a large number of RL agents and multiple environment replicas. \n",
    "\n",
    "In conjunction with training logic using Pytorch, we can perform extremely fast end-to-end training of multiple RL agents, all on a single GPU."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "09a2ea57",
   "metadata": {},
   "source": [
    "# GPU basics and terminology"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5acd00ec",
   "metadata": {},
   "source": [
    "Before we dive into WarpDrive, let's review some GPU basics. \n",
    "\n",
    "All programs that run on a GPU need to be triggered via a CPU. Commonly, the CPU is known as the *host* and the GPU as the *device*. [CUDA](https://developer.nvidia.com/cuda-zone) (Compute Unified Device Architecture) is an extension of C that implement code to be run on (CUDA-enabled) GPU hardware.\n",
    "\n",
    "| ![gpu_memory_model](https://github.com/salesforce/warp-drive/blob/master/tutorials/assets/gpu_memory_model.png?raw=true) |\n",
    "|:--:|\n",
    "| <b>Fig. 1 The CUDA memory model</b>|\n",
    "\n",
    "CUDA launches several `threads` in parallel. It organizes threads into a group called `thread block`. Additionally, the CUDA kernel can launch multiple thread blocks, organized into a `grid` structure. \n",
    "\n",
    "Therefore, a CUDA kernel runs a grid of blocks of threads.\n",
    "\n",
    "CUDA also provides built-in variables for accessing thread information - three key variables are\n",
    "\n",
    "1. `threadIdx.x`: contains the index of the thread within the thread block.\n",
    "2. `blockIdx.x`: the index of the thread block.\n",
    "3. `blockDim.x` contains the size of thread block (number of threads in the thread block). \n",
    "\n",
    "Each CUDA card has a maximum number of threads in a block (typically, 512, 1024, or 2048).\n",
    "\n",
    "*Note: In general, threads, blocks, and grids are multidimensional, i.e., they can also have threadIdx.y and z dimensions etc. We will not go into that here.*"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5c78d17d",
   "metadata": {},
   "source": [
    "## Dependencies"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "783d6fae",
   "metadata": {},
   "source": [
    "You can install the warp_drive package using\n",
    "\n",
    "- the pip package manager, OR\n",
    "- by cloning the warp_drive package and installing the requirements.\n",
    "\n",
    "On Colab, we will do the latter."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "60425a1c",
   "metadata": {},
   "outputs": [],
   "source": [
    "import sys\n",
    "IN_COLAB = 'google.colab' in sys.modules\n",
    "\n",
    "if IN_COLAB:\n",
    "    ! git clone https://github.com/salesforce/warp-drive.git \n",
    "    % cd warp-drive\n",
    "    ! pip install -e .\n",
    "else:\n",
    "    ! pip install rl_warp_drive"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8a0619fd",
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8c2d6341",
   "metadata": {},
   "outputs": [],
   "source": [
    "import warp_drive\n",
    "from warp_drive.managers.data_manager import CUDADataManager\n",
    "from warp_drive.managers.function_manager import (\n",
    "    CUDAFunctionManager, CUDALogController, CUDASampler, CUDAEnvironmentReset\n",
    ")\n",
    "from warp_drive.utils.data_feed import DataFeed"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b5383ca2",
   "metadata": {},
   "source": [
    "# WarpDrive Design Principles"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8c5f145c",
   "metadata": {},
   "source": [
    "Modern RL architectures ([SEED RL](https://arxiv.org/pdf/1910.06591.pdf), [ACME](https://arxiv.org/pdf/2006.00979.pdf), [IMPALA](https://arxiv.org/pdf/1802.01561.pdf), [ELF](https://arxiv.org/pdf/1707.01067.pdf), [MAVA](https://arxiv.org/pdf/2107.01460.pdf)) comprise several rollout CPU actors and a CPU/GPU trainer actor operating in tandem. While these architectures are very scalable, they may suffer from expensive communication between the CPU / GPU actors, which can lead to inefficient resource utilization. \n",
    "\n",
    "By moving both rollout generation and training exclusively to the GPU, we minimize the communication cost. In that case, all data is in the GPU's memory, and only (optional) training inspection requires a data transfer from the host to the device. We also minimize latency by having the rollout generation, batching, training and action inference all occur on the same device.\n",
    "\n",
    "Running end-to-end on a GPU is even more scalable for multi-agent RL. In essence, we can parallelize *rollouts* by having each agent operate individually on a separate thread, and each environment operate individually on a separate thread block. This results in extremely high training throughput. \n",
    "\n",
    "The figure below depicts our architecture block diagram; we will introduce `DataManager` and `FunctionManager` shortly.\n",
    "\n",
    "| ![](https://github.com/salesforce/warp-drive/blob/master/tutorials/assets/warpdrive_framework_overview.png?raw=true) |\n",
    "|:--:|\n",
    "| <b>Fig. 2 End-to-end multi-agent RL on a single GPU. Each GPU thread handles an agent, and each GPU block handles an environment. WarpDrive's DataManager and FunctionManager help manage the communication between the CPU and GPU and invoke the GPU kernel calls from the CPU, respectively.</b>|"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3052c05e",
   "metadata": {},
   "source": [
    "### PyCUDA"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f3daa9b9",
   "metadata": {},
   "source": [
    "Because most modern day programming is performed with Python, we have developed WarpDrive using [PyCUDA](https://documen.tician.de/pycuda/), a Python programming environment for CUDA.\n",
    "PyCUDA essentially provides additional wrappers on CUDA for easy Python access to CUDA APIs.\n",
    "\n",
    "To execute any PyCUDA program, there are three main steps:\n",
    "\n",
    "1. Copy the input data from host memory to device memory, also called *host-to-device transfer*.\n",
    "2. Load CUDA functions and execute, *caching data on-chip* for performance.\n",
    "3. Copy the results data from device memory to host memory, also called *device-to-host transfer*."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "04351c27",
   "metadata": {},
   "source": [
    "### Data and Function Managers"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bb8f3d20",
   "metadata": {},
   "source": [
    "Following the three steps above, we developed WarpDrive with the following two modules\n",
    "\n",
    "1. a **data manager** to handle all the data transfers between the host and the device. It also handles creating and managing the data for multiple environment replicas and time steps.\n",
    "2. a **function manager** to load the CUDA programs and execute."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fd49ae59",
   "metadata": {},
   "source": [
    "In the following, we will demonstrate how to push and pull data between the host and the device, and how to write simple CUDA functions to manipulate the date. Let's begin by creating a CUDADataManager object.\n",
    "\n",
    "We specify a few multi-agent RL parameters in the `DataManager` creator. \n",
    "\n",
    "We'll create a multi-agent RL environment with 3 agents, an episode length of 5, and 2 environment replicas."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4f18caec",
   "metadata": {},
   "outputs": [],
   "source": [
    "num_agents = 3\n",
    "num_envs = 2\n",
    "episode_length = 5\n",
    "\n",
    "cuda_data_manager = CUDADataManager(\n",
    "    num_agents,\n",
    "    num_envs,\n",
    "    episode_length\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dfdc056b",
   "metadata": {},
   "source": [
    "Now, let's create some (random) data that we would like to push to the device. In the context of RL, this can pertain to the starting states created by `env reset()`. \n",
    "\n",
    "The starting states are arrays that need to hold data such as observations, actions and rewards during the course of the episode. They could also contain environment configuration settings and hyperparameters. \n",
    "\n",
    "Each environment and agent will have its own data, so we create a `(num_envs, num_agents)`-shaped array that will be pushed to the GPU."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "801b8ecf",
   "metadata": {},
   "outputs": [],
   "source": [
    "random_data = np.random.rand(num_envs, num_agents)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cd3d29f2",
   "metadata": {},
   "outputs": [],
   "source": [
    "random_data"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5e624046",
   "metadata": {},
   "source": [
    "## Push and pull data from host (CPU) to device (GPU)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "25155936",
   "metadata": {},
   "source": [
    "In order to push data to the device, we have created a **DataFeed** helper object. For all data pushed from the host to device, we will need to provide a name identifier, the actual data, and two flags (both default to False):\n",
    "\n",
    "- `save_copy_and_apply_at_reset` - if `True`, we make a copy of the starting data so that we can set the data array to that value at every environment reset, and\n",
    "- `log_data_across_episode` - if `True`, we add a time dimension to the data, of size `episode_length`, set all $t>0$ index values to zeros, and store the data array at each time step separately. This is primarily used for logging the data for an episode rollout."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4a8e6a52",
   "metadata": {},
   "outputs": [],
   "source": [
    "data_feed = DataFeed()\n",
    "data_feed.add_data(\n",
    "    name=\"random_data\",\n",
    "    data=random_data,\n",
    "    save_copy_and_apply_at_reset=False,\n",
    "    log_data_across_episode=False\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5e7149f4",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "data_feed"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "11759f9c",
   "metadata": {},
   "source": [
    "The CUDA data manager provides the **push_data_to_device()** and **pull_data_from_device()** apis to handle data transfer between the host and the device."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "49796da1",
   "metadata": {},
   "outputs": [],
   "source": [
    "cuda_data_manager.push_data_to_device(data_feed)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f9d31873",
   "metadata": {},
   "source": [
    "Notice that the data manager casted the data from float64 to float32. CUDA always uses 32-bit floating or integer representations of numbers."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c4a2a75e",
   "metadata": {},
   "outputs": [],
   "source": [
    "data_fetched_from_device = cuda_data_manager.pull_data_from_device(\"random_data\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "58062fdf",
   "metadata": {},
   "source": [
    "The data fetched from the device matches the data pushed (the small differences are due to type-casting)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "32efb412",
   "metadata": {},
   "outputs": [],
   "source": [
    "data_fetched_from_device"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7259d3be",
   "metadata": {},
   "source": [
    "Another integral part of RL is training. We also need to hold the observations, actions and rewards arrays. So fo training, we will wrap the data into a Pytorch Tensor."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0532976f",
   "metadata": {},
   "source": [
    "## Making Training Data Accessible To PyTorch"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9122000a",
   "metadata": {},
   "source": [
    "Note that pushing and pulling data several times between the host and the device causes a lot of communication overhead. So, it's advisable that we push the data from the host to device only once, and then manipulate all the data on the GPU in-place. This is particularly important when data needs to be accessed frequently. A common example is the batch of observations and rewards gathered for each training iteration. \n",
    "\n",
    "Fortunately, our framework lets Pytorch access the data we pushed onto the GPU via pointers with minimal overhead. To make data accessible by Pytorch, we set the `torch_accessible` flag to True."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "428a9034",
   "metadata": {},
   "outputs": [],
   "source": [
    "tensor_feed = DataFeed()\n",
    "tensor_feed.add_data(\n",
    "    name=\"random_tensor\",\n",
    "    data=random_data\n",
    ")\n",
    "\n",
    "cuda_data_manager.push_data_to_device(tensor_feed, torch_accessible=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "82298d59",
   "metadata": {},
   "outputs": [],
   "source": [
    "tensor_on_device = cuda_data_manager.data_on_device_via_torch(\"random_tensor\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c6ff0a3b",
   "metadata": {},
   "source": [
    "## Time comparison for data pull (`torch_accessible` True versus False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c7660434",
   "metadata": {},
   "outputs": [],
   "source": [
    "large_array = np.random.rand(1000, 1000)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1e93dbbd",
   "metadata": {},
   "outputs": [],
   "source": [
    "data_feed = DataFeed()\n",
    "data_feed.add_data(\n",
    "    name=\"large_array\",\n",
    "    data=large_array,\n",
    ")\n",
    "cuda_data_manager.push_data_to_device(data_feed)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0ceadbd7",
   "metadata": {},
   "outputs": [],
   "source": [
    "timeit('data_fetched_from_device = cuda_data_manager.pull_data_from_device(\"large_array\")')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b8c7e93e",
   "metadata": {},
   "outputs": [],
   "source": [
    "data_feed = DataFeed()\n",
    "data_feed.add_data(\n",
    "    name=\"large_array_torch\",\n",
    "    data=large_array,\n",
    ")\n",
    "cuda_data_manager.push_data_to_device(data_feed, torch_accessible=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b2ed4dcc",
   "metadata": {},
   "outputs": [],
   "source": [
    "timeit('tensor_on_device = cuda_data_manager.data_on_device_via_torch(\"random_tensor\")')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9e4b5691",
   "metadata": {},
   "source": [
    "Currently, the `DataManager` supports primitive data types, such as ints, floats, lists, and arrays. If you would like to push more sophisticated data structures or types to the GPU, such as dictionaries, you may do so by pushing / pulling each key-value pair as a separate array."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "06e88c4a",
   "metadata": {},
   "source": [
    "## Code Execution Inside CUDA"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c488ff83",
   "metadata": {},
   "source": [
    "Once we push all the relevant data to the GPU, we will need to write functions to manipulate the data. To this end, we will need to write code in CUDA C, but invoke it from the host node. The `FunctionManager` is built to facilitate function initialization on the host and execution on the device. As we mentioned before, all the arrays on GPU will be modified on the GPU, and in-place. Let's begin by creating a CUDAFunctionManager object."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "016d62c4",
   "metadata": {},
   "outputs": [],
   "source": [
    "cuda_function_manager = CUDAFunctionManager(\n",
    "    num_agents=cuda_data_manager.meta_info(\"n_agents\"),\n",
    "    num_envs=cuda_data_manager.meta_info(\"n_envs\")\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c2ce4cb8",
   "metadata": {},
   "source": [
    "### Array manipulation inside CUDA"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0af2319f",
   "metadata": {},
   "source": [
    "The benefit of GPU processing comes from the fact that we can parallelize operations across threads and grids. In the context of multi-agent RL, it makes very good sense to associate each replica of the environment to a unique block and each agent to a unique thread in the block. Accordingly, we can use the built-in CUDA variables:\n",
    "\n",
    "`int env_id = blockIdx.x;`\\\n",
    "`int agent_id = threadIdx.x;`"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "617990d4",
   "metadata": {},
   "source": [
    "### Array indexing"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "90ff9c5b",
   "metadata": {},
   "source": [
    "An important point to understand is how multi-dimensional arrays are indexed inside CUDA. Remember that CUDA stores arrays in a C-contiguous ([row-major](https://en.wikipedia.org/wiki/Row-_and_column-major_order)) fashion (and so does Pytorch with its tensors). Accordingly, for the element at location $[i,j]$ in a data array of shape $(L,M)$, the corresponding index on CUDA is $i*M + j$.\n",
    "\n",
    "In general, for a $d$-dimensional array of shape $(N_1, N_2, \\ldots, N_d)$, the memory-offset for the element at index $(n_1, n_2, \\ldots, n_d)$ is\n",
    "$$n_d + N_d \\cdot (n_{d-1} + N_{d-1} \\cdot (n_{d-2} + N_{d-2} \\cdot (\\cdots + N_2 n_1)\\cdots)))\n",
    "= \\sum_{k=1}^d \\left( \\prod_{\\ell=k+1}^d N_\\ell \\right) n_k$$"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ffde402d",
   "metadata": {},
   "source": [
    "Now, let's write a simple function to add one to each element of the pushed data. We will perform this operation in parallel on the (num_envs) number of GPU blocks and the (num_agents) number of threads within.\n",
    "\n",
    "In general, the operation is (almost) parallel. Going into a bit more detail - CUDA employs a Single Instruction Multiple Thread (SIMT) architecture to manage and execute threads in groups of 32 called warps. So, as long as the number of agents is a multiple of 32, all the threads ar utilized, otherwise few threads remain idle. For example, if we use $1000$ agents, $24$ threads will remain idle, for a utilization rate of $97.65\\%$."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3e28ded4",
   "metadata": {},
   "outputs": [],
   "source": [
    "source_code = \"\"\"\n",
    "// A function to demonstrate how to manipulate data on the GPU.\n",
    "// This function increments each the random data array we pushed to the GPU before.\n",
    "// Each index corresponding to (env_id, agent_id) in the array is incremented by \"agent_id + env_id\".\n",
    "// Everything inside the if() loop runs in parallel for each agent and environment.\n",
    "//\n",
    "extern \"C\"{\n",
    "    __global__ void cuda_increment(                               \n",
    "            float* data,                                  \n",
    "            int num_agents                                       \n",
    "    )                                                            \n",
    "    {                                                            \n",
    "        int env_id = blockIdx.x;                                 \n",
    "        int agent_id = threadIdx.x;                             \n",
    "        if (agent_id < num_agents){                              \n",
    "            int array_index = env_id * num_agents + agent_id;\n",
    "            int increment = env_id + agent_id;\n",
    "            data[array_index] += increment;\n",
    "        }                                                            \n",
    "    }   \n",
    "}\n",
    "\"\"\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ba0abcbc",
   "metadata": {},
   "source": [
    "Notice that the keyword `__global__` is used on the increment function. Global functions are also called \"kernels\" - they are functions you may call from the host. There's also the keyword `__device__` for functions that cannot be called from the host, but may only be called from other device or global functions."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8529f436",
   "metadata": {},
   "source": [
    "Next, we use the `FunctionManager` API method **load_cuda_from_source_code()** to build and load the CUDA code.\n",
    "\n",
    "*Note: We only use the string-type source code for the purposes of exposition. In general, it's standard practice to have several standalone source codes written out in cuda (.cu) file, pre-compile them to a single binary (.cubin), and then use the `FunctionManager`'s **load_cuda_from_binary_file()**.* \n",
    "\n",
    "*Additionally, if we compile template source code (so that `num_agents` and `num_envs` can be used as macro variables at compile time), we can use the CUDA `Function<anager`'s **compile_and_load_cuda(template_header_file)**.*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3b1eb1c7",
   "metadata": {},
   "outputs": [],
   "source": [
    "cuda_function_manager.load_cuda_from_source_code(\n",
    "    source_code,\n",
    "    default_functions_included=False\n",
    ")\n",
    "cuda_function_manager.initialize_functions([\"cuda_increment\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2f13816c",
   "metadata": {},
   "source": [
    "We will use the `FunctionManager`'s API method **_get_function()** to load the CUDA kernel function and get an handle to invoke it from the host device."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1a8541bb",
   "metadata": {},
   "outputs": [],
   "source": [
    "increment_function = cuda_function_manager._get_function(\"cuda_increment\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a904022a",
   "metadata": {},
   "source": [
    "Now, when invoking the `increment` function, along with the `data` and `num_agents` arguments, we also need to provide the block and grid arguments. These are also attributes of the CUDA `FunctionManager`: simply use\\\n",
    "\n",
    "- `block=cuda_function_manager.block`, and\n",
    "- `grid=cuda_function_manager.grid`\n",
    "\n",
    "Also, since we need to use the `num_agents` parameter, we also need to push it to the device. Instead of using a `DataFeed`, we may also push as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b008ec1b",
   "metadata": {},
   "outputs": [],
   "source": [
    "cuda_data_manager.push_data_to_device({\n",
    "    \"num_agents\": {\n",
    "        \"data\": num_agents,\n",
    "        \"attributes\": {\n",
    "            \"save_copy_and_apply_at_reset\": False,\n",
    "            \"log_data_across_episode\": False\n",
    "        }\n",
    "    }\n",
    "})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a6ff098f",
   "metadata": {},
   "outputs": [],
   "source": [
    "increment_function(\n",
    "    cuda_data_manager.device_data(\"random_data\"),\n",
    "    cuda_data_manager.device_data(\"num_agents\"),\n",
    "    block=cuda_function_manager.block,\n",
    "    grid=cuda_function_manager.grid    \n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "754115ad",
   "metadata": {},
   "source": [
    "Below is the original (random) data that we pushed to the GPU:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b2190e93",
   "metadata": {},
   "outputs": [],
   "source": [
    "random_data"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "76eb38aa",
   "metadata": {},
   "source": [
    "and here's the incremented data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "24b27d21",
   "metadata": {},
   "outputs": [],
   "source": [
    "cuda_data_manager.pull_data_from_device(\"random_data\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4c3d770f",
   "metadata": {},
   "source": [
    "As expected, this method incremented each entry at index `(env_id, agent_id)` of the original data by `(env_id + agent_id)`! The differences are below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c38349da",
   "metadata": {},
   "outputs": [],
   "source": [
    "cuda_data_manager.pull_data_from_device(\"random_data\") - random_data"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9b539731",
   "metadata": {},
   "source": [
    "And we can invoke the increment function again to increment one more time (also in-place on the GPU), and the differences double."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8465da3c",
   "metadata": {},
   "outputs": [],
   "source": [
    "increment_function(\n",
    "    cuda_data_manager.device_data(\"random_data\"),\n",
    "    cuda_data_manager.device_data(\"num_agents\"),\n",
    "    block=cuda_function_manager.block,\n",
    "    grid=cuda_function_manager.grid    \n",
    ")\n",
    "cuda_data_manager.pull_data_from_device(\"random_data\") - random_data"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ae1fa078",
   "metadata": {},
   "source": [
    "### Validating CUDA parallelism"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1df76f53",
   "metadata": {},
   "source": [
    "We put all the pieces introduced so far together, and record the times for parallelized operations with different `num_envs` and `num_agents` settings."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "db7f4f3a",
   "metadata": {},
   "outputs": [],
   "source": [
    "from timeit import Timer\n",
    "\n",
    "def push_random_data_and_increment_timer(\n",
    "    num_runs=1,\n",
    "    num_envs=2,\n",
    "    num_agents=3,\n",
    "    source_code=None\n",
    "):\n",
    "   \n",
    "    assert source_code is not None\n",
    "\n",
    "    # Initialize the CUDA function manager \n",
    "    cuda_function_manager = CUDAFunctionManager(\n",
    "        num_agents=cuda_data_manager.meta_info(\"n_agents\"),\n",
    "        num_envs=cuda_data_manager.meta_info(\"n_envs\")\n",
    "    )\n",
    "\n",
    "    # Load source code and initialize function\n",
    "    cuda_function_manager.load_cuda_from_source_code(\n",
    "        source_code,\n",
    "        default_functions_included=False\n",
    "    )\n",
    "    cuda_function_manager.initialize_functions([\"cuda_increment\"])\n",
    "    increment_function = cuda_function_manager._get_function(\"cuda_increment\")\n",
    "        \n",
    "    def push_random_data(\n",
    "        num_agents,\n",
    "        num_envs\n",
    "    ):\n",
    "        # Initialize the CUDA data manager\n",
    "        cuda_data_manager = CUDADataManager(\n",
    "            num_agents=num_agents,\n",
    "            num_envs=num_envs,\n",
    "            episode_length=100\n",
    "        )\n",
    "\n",
    "        # Create random data\n",
    "        random_data = np.random.rand(num_envs, num_agents)\n",
    "\n",
    "        # Push data from host to device\n",
    "        data_feed = DataFeed()\n",
    "        data_feed.add_data(\n",
    "            name=\"random_data\",\n",
    "            data=random_data,\n",
    "        )\n",
    "        data_feed.add_data(\n",
    "            name=\"num_agents\",\n",
    "            data=num_agents\n",
    "        )\n",
    "        cuda_data_manager.push_data_to_device(data_feed)\n",
    "\n",
    "    def increment_data():\n",
    "        increment_function(\n",
    "            cuda_data_manager.device_data(\"random_data\"),\n",
    "            cuda_data_manager.device_data(\"num_agents\"),\n",
    "            block=cuda_function_manager.block,\n",
    "            grid=cuda_function_manager.grid    \n",
    "        )\n",
    "        \n",
    "    data_push_time = Timer(lambda: push_random_data(num_agents, num_envs)).timeit(number=num_runs)\n",
    "    program_run_time = Timer(lambda: increment_data()).timeit(number=num_runs)\n",
    "\n",
    "    return {\n",
    "        \"data push times\": data_push_time,\n",
    "        \"code run time\": program_run_time\n",
    "    }"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1b2d2db1",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%capture\n",
    "\n",
    "num_runs = 100\n",
    "times = {}\n",
    "\n",
    "for scenario in [\n",
    "    (1, 1),\n",
    "    (1, 100),\n",
    "    (1, 1000),\n",
    "    (100, 1000),\n",
    "    (1000, 1000)\n",
    "]:\n",
    "    num_envs, num_agents = scenario\n",
    "    times.update(\n",
    "        {\n",
    "            f\"envs={num_envs}, agents={num_agents}\":\n",
    "            push_random_data_and_increment_timer(\n",
    "                num_runs,\n",
    "                num_envs,\n",
    "                num_agents,\n",
    "                source_code\n",
    "            )            \n",
    "        }\n",
    "    )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a9b0934d",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(f\"Times for {num_runs} function calls\")\n",
    "print(\"*\"*40)\n",
    "for key, value in times.items():\n",
    "    print(f\"{key:30}: mean data push times: {value['data push times']:10.5}s,\\t mean increment times: {value['code run time']:10.5}s\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "21334661",
   "metadata": {},
   "source": [
    "As we increase the number of environments and agents, the data size becomes larges, so pushing data becomes slower, but since all the threads operate in parallel, the time taken in the increment function remains about the same!"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6cb897cb",
   "metadata": {},
   "source": [
    "And that's it! By using building blocks such as the increment function, we can create arbitrarily complex functions in CUDA C. For some comparative examples, please see the example environments that have both Python implementations in `examples/envs` and corresponding CUDA C implementations in `src/envs`."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "90f9a076",
   "metadata": {},
   "source": [
    "Below are some useful starting resources for CUDA C programming:"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "90c986b0",
   "metadata": {},
   "source": [
    "- [CUDA tutorial](https://cuda-tutorial.readthedocs.io/en/latest/)\n",
    "- [Learn C](https://learnxinyminutes.com/docs/c/)\n",
    "- [CUDA Quick Reference](http://www.icl.utk.edu/~mgates3/docs/cuda.html)\n",
    "<!-- - [Thrust](https://developer.nvidia.com/thrust). Note: thrust is a flexible, high-level interface for GPU programming that greatly enhances developer productivity. -->"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2e451b38",
   "metadata": {},
   "source": [
    "# Learn More and Explore our Tutorials!"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4712e2d3",
   "metadata": {},
   "source": [
    "This is the first tutorial on WarpDrive. To learn more, take a look at the following tutorials:\n",
    "\n",
    "- [WarpDrive sampler](https://www.github.com/salesforce/warp-drive/blob/master/tutorials/tutorial-2-warp_drive_sampler.ipynb)\n",
    "- [WarpDrive resetter and logger](https://www.github.com/salesforce/warp-drive/blob/master/tutorials/tutorial-3-warp_drive_reset_and_log.ipynb)\n",
    "- [Create custom environments](https://www.github.com/salesforce/warp-drive/blob/master/tutorials/tutorial-4-create_custom_environments.ipynb)\n",
    "- [Training with WarpDrive](https://www.github.com/salesforce/warp-drive/blob/master/tutorials/tutorial-5-training_with_warp_drive.ipynb)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
